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Abstract 



We determine an explicit Grobner basis, consisting of linear forms and determi- 
nantal quadrics, for the prime ideal of Raftery's mixture transition distribution 
model for Markov chains. When the states are binary, the corresponding pro- 
jective variety is a linear space, the model itself consists of two simplices in a 
cross-polytope, and the likelihood function typically has two local maxima. In the 
' general non-binary case, the model corresponds to a cone over a Segre variety. 
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1 Introduction 

In this note we investigate Adrian Raftery's mixture transition distribution model (MTD) 
QV from the perspective of algebraic statistics jU IB]. The MTD model, which was first 
proposed in [9], has a wide range of applications in engineering and the sciences |10j . 
The article by Berchtold and Raftery [2] offers a detailed introduction and review. 

r~~~- The point of departure for this project was a conjecture due to Donald Richards 

stating that the likelihood function of an MTD model can have multiple local maxima. 
We establish this conjecture for the case of binary states in Proposition [6] 

Our main result, to be derived in Section 4, gives an explicit Grobner basis for the 
MTD model. Here, both the sequence length and the number of states are arbitrary. 

We begin with an algebraic description of the model in [2J [9] . Fix a pair of positive 
integers I and m, and set N = m l+l — 1. We define the statistical model MTD^ m whose 
state space is the set [m] /+1 of sequences io«i * ■ - of length / + 1 over the alphabet 
[m] = {1,2,..., m}. The model has (m — l)m + 1 — 1 parameters, given by the entries 
of an m x m-transition matrix (q^) and a probability distribution A = (Ai, . . . , Xi) on 
the set [/] = {1,2,...,/} of the hidden states. Thus the parameter space is the product 
of simplices (A m _i) m x Aj_i. The model MTD^ m will be a semialgebraic subset of the 
simplex A at- That simplex has its coordinates pi Q i 1 ...i l indexed by sequences in [m] /+1 . 

The model MTD; m is the image of the bilinear map 

(f>i ttn : (A m _!) m x A,_i ->■ A N 
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which is defined by the formula 




(1) 



As is customary in algebraic statistics, we pass to a simpler object of study by considering 
the Zariski closure MTD/ m of our model in the complex projective space P^, and we 
seek to compute the homogeneous prime ideal of all polynomials in the N + 1 unknowns 
Pioh...ii that vanish on MTDi jfn . This particular goal will be reached in our Theorem [HJ 

The following probabilistic interpretation of the formula (CQ) makes it evident that 
YliPioh-k = 1 holds on the image of <t>i tm . We generate a sequence of length 1 + 1 on 
m states as follows. First we select from the uniform distribution on all m l sequences 
toil ■ ■ -ii-i of length /. All that remains is to determine the state i\ in position I. The 
mixture distribution A determines which of the earlier states gets used in the transition. 
With probability Xj, we select position j — 1 for that. The character in the last position 

1 is determined from the state ij-i in position j — 1 using the transition matrix (%■). 

The model MTDj jm is known to be identifiable [21 §4.2]. Consequently, the dimension 
of the projective variety MTD; m is equal to the number (m — l)m + / — 1 of model 
parameters. A geometric characterization of this variety will be given in Corollary [TTJ 

Equations defining Markov chains and Hidden Markov Models have received consid- 
erable attention in algebraic statistics [21 El El [12]. We contribute to this literature by 
studying the algebraic geometry of a fundamental model for higher order Markov chains. 
In addition to our theoretical results in Theorems [I] and [HI readers from statistics will find 
in Section 3 an analysis of the behavior of the EM algorithm for binary MTD models. 

2 Binary States 

Our first result concerns the geometry of the model in the case m = 2 of binary states. 

Theorem 1. The variety MTD; 2 is a linear sub space of dimension I + 1 in the projective 
space P . This variety intersects the probability simplex in a regular cross-polytope 
of dimension I + 1. The model MTD^ 2 is the union of two (I + l)-simplices spanned by 
vertices of the cross-polytope MTD/^nAjv. The two simplices meet along a common edge. 

The cross-polytope is the free object in the category of centrally symmetric polytopes 
[T3] . It can be represented as the convex hull of all signed unit vectors and — e,; where 
% = 0, 1, ...,/, so it is an (7 + l)-dimensional polytope with 2/ + 2 vertices and 2 l+l facets. 

Before we come to the proof Theorem [U let us first see some examples to illustrate 
it. In what follows we abbreviate the model parameters by qn = a, q<i\ = b and A 2 = A. 

Example 2. Theorem [1] also applies in the trivial case 1 = 1, where (Cp) reads 
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The variety MTD 12 is the plane in P 3 given by p n + p 12 = P21 + P22- Its intersection 
with the tetrahedron A 3 coincides with the model MTD 12 , which is a regular square: 

MTDl , 2 = MTD M OA, = J) , , , (J $)}. 

The first three and last three matrices in this list form the two triangles referred to in 
Theorem [TJ Their common edge consists of all transition matrices ([2]) of rank 1. □ 

Example 3. Our first non-trivial example arises for I = m = 2. The map 2)2 is given by 

ae m + (Xb + (1 - A)a)e m + (Aa + (1 - A)6)e 2 n + be 22 i + (l-a)en 2 
+(A(l-6)+(l-A)(l-a))e 122 + (A(l-a)+(l-A)(l-6))e 212 + (l-6)e 222 _ 

Here {em, en 2 , . . . , e 222 } denotes the standard basis in the space of 2 x 2 x 2-tensors. 
The variety MTD 2 2 is the 3-dimensional linear subspace of P 7 defined by 

+ P112 = P121 + P122, P211 + P212 = P221 + P222, 

P121 + £>122 = P221 + £>222, Pill + P221 = P\2\ + Pill- 

The intersection of this linear space with the simplex A 7 is the regular octahedron whose 
vertices are the images under 2j2 of the vertices of the cube (A^ 2 x A x . The model 
MTD 2j2 consists of two tetrahedra formed by vertices of the octahedron. Their common 
edge is the segment between \{e xll + e m +e 2 n + e 22i ) and ^(en 2 + ei 22 + e 2 i 2 + e 222 ). □ 

Example 4. The statement of Theorem Q] does not extend to m > 3. Consider the case 
I — 2, m — 3. The 7-dimensional variety MTD 2j3 lives in P 26 , and it is not a linear space. 
The linear span of MTD 2 3 is 10-dimensional. Inside this P 10 , the variety MTD 2 3 has 
codimension 3, degree 4, and it is cut out by six quadrics. In Example [TU1 we shall display 
a Grobner basis consisting of 16 linear forms and six quadrics for its prime ideal. □ 

Proof of TheoremUi It is known by [21 §4.2] that the model is identifiable, so MTD^ 2 
is a semi-algebraic set of dimension / + 1 in Ajy. Its Zariski closure MTDi j2 is a variety 
of dimension / + 1 in P^. That variety is irreducible because it is defined by way of a 
rational parametrization. For any binary sequence i^ii ■ ■ - ii-i, the identity 

Pioii---j;_i2 = 2 — Pj j 1 ...j ; _ 1 i 

holds on MTD/ j2 , so it suffices to consider relations on probabilities of sequences that 
end with 1. On our model, these probabilities satisfy the linear equations 

Pioil—i T —i a —ii-il Pi i 1 ...i r ..-i s ...i l _ 1 i Pioii---i r "4 S "-ii-il Pioii—ir'-is—ii-ii' \ ) 

In other words, the Z-dimensional 2x2x ■ ■ • x2-tensor {pi i 1 ---i l _ 1 i) has tropical rank 1. 
The set of such tensors is a classical linear space of dimension / + 1. 

Solving the linear equations ([3]) and (j4j) on the simplex A N , we obtain an (7 + 1)- 
dimensional polytope P that contains the model MTD; 2 . Its Zariski closure in P^ is 
an (7 + l)-dimensional linear space that contains the variety MTDi j2 . Being irreducible 
varieties of the same dimension, they must be equal. This proves the first assertion. 
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(a, b, A) 1 — y p 



(3) 



We next claim that the polytope P of all non-negative real solutions to fl3]) and (J4]) 
is a regular cross-polytope. For r e {0, 1, ...,/ — 1} and s £ {1,2} define the 21 points 
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These are extreme non- negative solutions of ([3]) and (jl]). They form the vertices of 
an /-dimensional cross-polytope, since \{E r \ + E r2 ) is equal to the uniform distribution 
27^re + _| |_ + for all r. In addition to the 21 vertices E rs , the polytope P has two more ver- 
tices, namely, ^-e ++ ... +1 and ^ r e ++ ... +2 . Hence P is a bipyramid over the /-dimensional 
cross-polytope, so it is an (/ + l)-dimensional cross-polytope. 

It remains to identify the model MTD; i2 inside P. The parameter polytope is the 
product (Ai) 2 x Aj_i, and, as before, we chose coordinates (a,b) on the square (Ai) 2 . 
The map 4>i,i contracts the simplex {(0,0)} x A/_i onto the vertex ^ r e ++ ... +2 of P, and 
it contracts the simplex {(1,1)} x Aj_i onto the vertex ^ r e ++ ... + i of P. The vertex 
(0,1) x e r is mapped to the vertex P r2 , and the vertex (1,0) x e r is mapped to the 
vertex E r \. The parameter points with a = b are contracted onto the line segment 
S = [^-e ++ ... + i, 7p-e ++ ... +2 ]. The parameter points with a < b are mapped bijectively 
onto the (/ + l)-simplex formed by S and {E 0j2 , E lj2 , ■ ■ ■ , -E/_i j2 }, but with 5* removed. 
The parameter points with a > b are mapped bijectively onto the (/ + l)-simplex formed 
by S and {E 0t i, Ei } ±, . . . , Pj-i^}, but with S removed. Hence MTDj j2 equals the union 
of two (/ + l)-simplices glued along the special diagonal S of the cross-polytope P. □ 

Corollary 5. For large I, there are far fewer distributions in the model MTD; ]2 than 
distributions in its Zariski closure. Namely, with respect to Lebesgue measure, we have 

vol(MTDj )2 ) 1 
vol(MTD ii2 n A N ) ~ 2F 1 ' 

Proof. We can triangulate the cross-polytope P into 2 l simplices, all of the same volume 
and containing the special diagonal S. The model MTD; j2 consists of two of them. Hence 
2/2 1 is the fraction of the volume of P = MTD^ D An that is occupied by MTD; 2 . □ 



3 Likelihood inference 

We next discuss maximum likelihood estimation (MLE) for the mixture transition dis- 
tribution model MTD; m . Any data set is represented by a function u : [m]' +1 — > N 
that records the frequency counts of the observed sequences. Given such a function u, 
our objective is to maximize the corresponding log-likelihood function 

Lu = Wioii-ii • l°g(Pioii-ii) (5) 

toil— ii 

over all probability distributions that lie in the model MTD; m . A standard method 
for solving this optimization problem is the expectation-maximization (EM) algorithm. 
Other algorithms for the same task can be found in [TJ ITU] . 
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A general version of the EM algorithm for algebraic models with discrete data is 
described in [SJ §1.3], while the specific case of the MTD model is treated in [2J §4.5]. 
Richards [TT] conjectured that the EM algorithm for the MTD model may get stuck in 
local maxima. Our next result confirms that this is indeed the case, even for m — 2. 

Proposition 6. The log-likelihood function L u on the binary model MTD^ has either 
one or two local maxima. With probability one, there will be two local maxima, and both 
of these will be reached by the EM algorithm for different choices of initial parameters. 

Here the statement about "probability one" in the second sentence refers to any 
absolutely continuous probability distribution that is positive on the simplex A^. 

Proof. We saw in Theorem [T] that MTD/ 2 is the union of two convex polytopes. The 
log-likelihood function L u is strictly concave on the ambient simplex A^v, so it attains a 
unique maximum on each of the two polytopes. This proves the first statement. 

For the second statement consider the empirical distribution u/\u\ which is a point 
in Ajy. Its log-likelihood function L u has a unique maximum p* in the interior of the 
cross-polytope P. With probability one, this maximum p* will not lie in the segment 
S, so let us assume that this is the case. Then either p* lies in precisely one of the two 
(/ + l)-simplices that make up MTDy, or p* does not lie in MTD^. In the former case, p* 
is the MLE, and the maximum over the other simplex is in the boundary of that simplex 
and constitutes a second local maximum. In the latter case, each of the two simplices 
has a local maximum in its boundary. When choosing starting parameter values near 
either of these local maxima, the EM algorithm converges to that local maximum. □ 

The point p* in the cross-polytope P at which L u attains its maximum is an algebraic 
function of the data u. The degree of this algebraic function is the ML degree (see [TJ) of 
the linear subvariety MTD/ 2 of P^. By Varchenko's Formula [8} Theorem 1.5], this ML 
degree coincides with the number of bounded regions in an arrangement of hyperplanes. 
This arrangement lives inside the affine space that is cut out by ([3]) and (Hj) and it 
consists of the restrictions of the coordinate hyperplanes {p, = 0}. 

Computations show that the ML degree equals 9 for / = 3, and it equals 209 for I = 4. 
It would be interesting to find a general formula for that ML degree as a function of 

The local maxima that occur on the boundary of the two simplices of MTD; 2 have 
ML degree 1, that is, they are expressed as rational functions in the data u. Indeed, 
these local maxima are precisely the estimates for the Markov chain obtained by fixing 
Xi = 1 for some i. Hence, if p* ^ MTD^, then the MLE is a rational expression in u. 
The next example illustrates the behavior of the EM algorithm for m = 2 and / = 3. 

Example 7. The data consists of eight positive integers, here written as a matrix 



U 




The MLE p will be either 



P 




2\u\ \UU2 + U 2 \2 M122+M222 M112 + M212 M122+M222 



Will + U 2 ll U121 + W 2 21 Win + W211 W121 + M221 
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or 



P 



2\u\ 



uni + M121 uni + M121 M211 + M221 u 2 n + M221 

"112 + ^122 "112 + ^122 ^212 + ^222 M 2 12 + U 22 2 



or it will be the unique probability distribution satisfying 



and 



rank 



/win 
P111 
P111 





\Piii 



Urn 
P112 

Pll2 







M121 


Ml22 


M211 


^212 


«221 


M222 \ 


P121 


Pl22 


P211 


P212 


P221 


P222 


-Pl21 


-Pl22 




















P211 


P212 


-P221 


-P222 


P121 


P122 








-P221 


-P222 


-Pl21 





-P211 





P221 


/ 



< 5. 



(6) 



This is the matrix denoted 



in §3]. The rank constraint ()6]) represents Proposi- 
tion 2 in [7] . The unique probability distribution that lies in our model and also satisfies 
(JEJ) was called p* in the proof of Proposition El Its defining constraints (j3J), (BJ and (EJ) 
form a system of polynomial equations that has 9 complex solutions. The distribution 
p* is the unique solution to that system whose coordinates are both real and positive. 

The trichotomy in this example is best explained by the following observations: For 
almost all data matrices U, the three points p',p",p* are distinct, one of them coincides 
with the global maximum p of L u over MTD^ 2 , and another one is a local maximum. □ 

It would be interesting to extend the findings in Proposition[6]to m > 3. The algebraic 
tools that may be needed for such an analysis are developed in the next section. 



4 Non-linear Models 



In this section we examine the geometry of model MTD; )m and the variety MTD/ m for an 
arbitrary number m of states. In particular, we prove that its prime ideal is minimally 
generated by linear forms and quadrics. These minimal generators form a Grobner basis. 

Theorem 8. The variety MTD; iTn spans a linear space of dimension (m — l)(/m — / + 1) 
in ¥ N . In this linear space, its prime ideal is given by the 2 x 2-minors of an I x (m — l) 2 - 
matrix of linear forms. The linear and quadratic ideal generators form a Grobner basis. 

This theorem explains our earlier result that the model is linear for binary states. 
Indeed, for m = 2, the dimension (m — l)m + I — 1 of the model coincides with the 
dimension (m — l)(/m — / + 1) of the ambient linear space, and there are no 2 x 2-minors. 

Proof. We shall present an explicit Grobner basis consisting of linear forms and quadrics. 
The term order we choose is the reverse lexicographic term order induced by the lexico- 
graphic order on the states z'oii • • • %i of the model. We first consider the linear relations 

1-1 

+ 1) P mm- ■ -mm%i • (7) 

3=0 



6 



This linear form is non-zero and has the underlined leading term if and only if at least 
two of the entries of the /-tuple (i , i\, . . . , ii-i) are not equal to m. Thus the number of 
distinct Grobner basis elements ([7]) equals m l+1 — m(l + l(m — 1)). 

Our second class of Grobner basis elements consists of the linear relations 

''' "" + h " ! " 2 + " ' + ( 8 ) 



ml — ■ ■ ■ — P 



m- ■ ■mmm- ■ -mm • 



These linear forms are non-zero with the underlined leading term provided < j < I — 1 
and 1 < ij < m — 1. The number of distinct linear forms (IE]) equals l(m — 1), and the 
set of their leading terms is disjoint from the set of leading terms in ([7]). 

The number of unknowns p, not yet underlined equals l(m — l) 2 + (m — 1) + 1. We use 
these unknowns to form m — 1 matrices A 2 , A 3 , . . . , A m , each having format I x (m — 1), 
as follows. Define the matrix A r by placing the following entry in row j and column if 

(9) 

We finally form an / x (m — l) 2 matrix by concatenating these m — 1 matrices: 

A = (A 2 A 3 ■■■ A^. (10) 

The third and last group of polynomials in our Grobner basis is the set of 2 x 2-minors 
of A. The entries of A have distinct leading terms, underlined in ()9]), and the leading 
term of each 2 x 2-minor is the product of the leading terms on the main diagonal. 

Note that we could also define the matrix Ai and include it when forming (110)) . This 
would not change the ideal, but it would lead to a generating set that is not minimal. 

It is well-known that the 2 x 2-minors of a matrix of unknowns form a Grobner basis 
for the prime ideal they generate. Since no unknown p, underlined in ([7]) or ()8]) appears 
in the matrix A, it follows that these linear relations together with the 2 x 2-minors of 
(jTU]) generate a prime ideal and form a Grobner basis for that prime ideal. 

The ideal of 2 x 2 minors of A has codimension l(m — l) 2 — I— (m— 1) 2 + 1. Subtracting 
this quantity from the number l(m — l) 2 + (m — 1) + 1 of unknowns not underlined in (J7J) 
or OH]), we obtain / + (m — l) 2 — 1 + (m — 1) + 1 = (m — l)m + l. This is the dimension of 
the affine variety defined by our prime ideal. The corresponding irreducible projective 
variety has dimension (m — l)m + / — 1. This is precisely the dimension of MTDi. m . 

It hence suffices to prove that our variety contains the model MTD; im , or, equivalently, 
that the linear forms (J7J and (jSJ) are mapped to by the parameterization (CE]), and that 
the specialized matrix <fii tm (A) has rank 1. For (jSj) this is obvious because, for fixed ij, 



m j 
^ ^j ^Um (Pm- ■ -mij m- ■ -mr) 777/' 



Here <fi* lm denotes the homomorphism of polynomial rings induced by the map 4>i, m . 

The indices of the unknowns in the linear form (J7|) all have the same letter i\ in the 
end. The formula (JT]) for the corresponding probabilities can thus be written as 
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In other words, for any fixed ii, the resulting /-dimensional tensor has tropical rank 1. 
This representation implies linear relations like 01]), and these are equivalent to (J7J). 

Finally, if we apply our ring homomorphism to then we get 

4*1 m\P m '"' m ij m '"' mr ) 4 > l,m(Pni''"mmrrf'mr) "\j ' [_Qij,r Qm,r)- (H) 

Thus, the matrix (pk,i{A) is the product of the column vector (Ai, . . . , A;) and a row vector 
of length (m — l) 2 whose entries are qi j>r — q m ^ r for 2 < r < m and 1 < ij < m — 1. In 
particular, the matrix z * m (A) has rank < 1. This completes the proof of Theorem [HJ □ 

Remark 9. The prime ideal in Theorem [8] is the kernel of 4>* m , so it characterizes the 
image of the model parametrization 0; jm . On the model MTD; m , the map <fr^ m can be 
inverted as long as the rows of the transition matrix (q^) are distinct. Indeed, q^ equals 
2' rniPu-iij) > an d the coordinates of A are identified from f[TTj) . Thus, our result refines 
the well-known fact that MTD models are identifiable [21 §4.2]. 

Example 10. We illustrate Theorem [8] for the case / = 2,m = 3, by presenting the 
Grobner basis promised in Example HJ Note that N = 26. Here the ambient linear 
space has dimension (m — l)(7m — / + 1) = 10, and our Grobner basis for that linear 
space consists of twelve linear forms (|7|) and four linear forms OH]). These are respectively, 

Pm-£>311-£>131+P331, Pl21-P321-Pl31+P331, P211 ~P311 ~P231+P331 , P221 ~P321 -^231+^331 , 
Pll2-£>312-Pl32+P332, Pl22-P322~Pl32+P332, P212~P312-P232+P332, P222~P322~P232+P332, 
P113-P313-P133+P333, Pl23~P323-Pl33+P333, P213-P313~P233+P333, P223~P323~P233+P333- 

and P311 + p 3 12 + £>313 - P331 - £>332 ~ £>333 , P321 + P322 + £>323 ~ £>331 ~ £>332 - £>333 , 
Pl31 + Pl32 + Pl33 - £>331 - £>332 ~ £>333 , P231 + P232 + £>233 ~ P331 ~ P332 - P333- 

The remaining l(m — l) 2 + (m — 1) + 1 = 8 + 2 + 1 = 11 not yet underlined unknowns 
are ^132,^232,^312,^322,^133,^233,^313,^323,^332,^333,^331- These represent coordinates 
on the linear subspace P 10 of P 26 that is cut out by these linear forms. Inside that linear 
subspace P 10 , our variety MTD 2i 3 has codimension 3, and it is defined ideal-theoretically 
by the 2 x 2-minors of the 2 x 4-matrix 

a ( A A ) I ^ 132 ~~ ^ 332 ^ 232 — P' 332 P 133 ~ P 333 P" 233 ~~ ^ 333 ) 

^ 2 3) \P312-P332 P322 -P332 P313 ~ P333 P323 ~ P333 J ' 

These six quadrics, together with the 16 linear forms, form a reduced Grobner basis. □ 
Our proof of Theorem [8] gives rise to the following geometric description: 



Corollary 11. The projective variety MTD^ m is a cone with base P m_1 over the Segre 
variety P /_1 x P m ~ 2m . Ifm>3, then this variety is singular and its singular locus is 
the P™" 1 that forms the base of that cone. The degree of MTD/ jfn equals ( m ,~^ 2 ) ■ 



Proof. The ideal of singular locus of MTD; im is generated by the entries of the matrix 
A together with the linear forms (CO) and (jSJ). Together, these linear equations are 
equivalent to requiring that the value of Pi i 1 ...i I _ ir depends only on r. It does not on 
i$i\ ■ ■ ■ These constraints define a linear space P m_1 in P . The 2 x 2-minors of an 
/ x (m— l) 2 matrix define the Segre variety P' _1 x P m ~ 2m , whose degree is known to be 
the binomial coefficient. □ 
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